The effect of OCR errors on stylistic text classification
Identifieur interne : 001147 ( Main/Exploration ); précédent : 001146; suivant : 001148The effect of OCR errors on stylistic text classification
Auteurs : Sterling Stuart Stein [États-Unis] ; Shlomo Argamon [États-Unis] ; Ophir Frieder [États-Unis]Source :
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Classification.
English descriptors
- KwdEn :
Abstract
Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000361
- to stream PascalFrancis, to step Curation: 000425
- to stream PascalFrancis, to step Checkpoint: 000306
- to stream Main, to step Merge: 001167
- to stream Main, to step Curation: 001147
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">The effect of OCR errors on stylistic text classification</title>
<author><name sortKey="Stein, Sterling Stuart" sort="Stein, Sterling Stuart" uniqKey="Stein S" first="Sterling Stuart" last="Stein">Sterling Stuart Stein</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, Shlomo" sort="Argamon, Shlomo" uniqKey="Argamon S" first="Shlomo" last="Argamon">Shlomo Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, Ophir" sort="Frieder, Ophir" uniqKey="Frieder O" first="Ophir" last="Frieder">Ophir Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">06-0519586</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0519586 INIST</idno>
<idno type="RBID">Pascal:06-0519586</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000361</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000425</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000306</idno>
<idno type="wicri:Area/Main/Merge">001167</idno>
<idno type="wicri:Area/Main/Curation">001147</idno>
<idno type="wicri:Area/Main/Exploration">001147</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">The effect of OCR errors on stylistic text classification</title>
<author><name sortKey="Stein, Sterling Stuart" sort="Stein, Sterling Stuart" uniqKey="Stein S" first="Sterling Stuart" last="Stein">Sterling Stuart Stein</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, Shlomo" sort="Argamon, Shlomo" uniqKey="Argamon S" first="Shlomo" last="Argamon">Shlomo Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, Ophir" sort="Frieder, Ophir" uniqKey="Frieder O" first="Ophir" last="Frieder">Ophir Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Classification</term>
<term>Content analysis</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Text</term>
<term>Text analysis</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Analyse contenu</term>
<term>Recherche information</term>
<term>Texte</term>
<term>Classification</term>
<term>Analyse texte</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Classification</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Illinois</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Illinois"><name sortKey="Stein, Sterling Stuart" sort="Stein, Sterling Stuart" uniqKey="Stein S" first="Sterling Stuart" last="Stein">Sterling Stuart Stein</name>
</region>
<name sortKey="Argamon, Shlomo" sort="Argamon, Shlomo" uniqKey="Argamon S" first="Shlomo" last="Argamon">Shlomo Argamon</name>
<name sortKey="Frieder, Ophir" sort="Frieder, Ophir" uniqKey="Frieder O" first="Ophir" last="Frieder">Ophir Frieder</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001147 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001147 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:06-0519586 |texte= The effect of OCR errors on stylistic text classification }}
This area was generated with Dilib version V0.6.32. |